19.1 Inference as Optimization

Exact inference can be described as an optimization problem.

  • Goal: compute \(\log p(v; \theta)\)
  • Challenge: costly to marginalize out h
  • Solution: Compute the lower bound of \(\log p(v; \theta)\), aka evidence lower bound (ELBO).

Introduce q as an arbitrary distribution over h

\[\begin{split}\begin {equation} \begin{split} KL[q(h|v) || p(h|v)] &= \sum_h q(h|v) \log \frac{q(h|v)}{p(h|v)} \\ &= \sum_h q(h|v) \log \frac{q(h|v)}{\frac {p(x, v)}{p(v)}} \\ &= \sum_h q(h|v) \log p(v) + \sum_h q(h|v) \log q(h) - \sum_h q(h|v) \log p(v, h) \\ &= \log p(v) - H(q(h|v)) - E_{h\sim q(h|v)}[\log p(v, h)] \end{split} \end {equation}\end{split}\]

Now we have

\[\log p(v) = E_{h\sim q(h|v)}[\log p(v, h)] + KL[q(h|v) || p(h|v)] + H(q)\]

Because KL divergence is always nonnegative we have lower bound

\[L(v, \theta, h) = E_{h\sim q(v|h)}[\log p(v, h)] + H(q(v|h))\]

For appropriate of q, L is tractable to compute. For q(h|v)) that are better approaximation of p(h|v), the lower bound would be tighter, meaning closer to log p(v). When q(h|v) = p(v|h), the approaximation is perfect, \(L(v, \theta, q) = log P(v; \theta)\).

We can think of inference as the procedure for finding the q that maximizes L. Solutions

  • Exact inference maximizes L perfectly by searching over a family of functions q that includes p(h|v). (Brute Force)
  • Approximate by restricting the family of distribution q that the optimization is allowed to search over
  • Approxiamte by using imperfect optimization procedure that may not completely maximize L but may only increase it by significant amount.

Review on Shannon entropy

We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy:

\[H(x) = E_{x \sim P}(I(x)) = - E_{x \sim P}(\log P(x))\]